## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
# Univariate Analysis
This is a clean dataset by Cortez et. al in 2009. The dataset is clean and the only variable that contains values of zero is “citric.acid”, which is plausible and can be trusted.
Given my limited knowledge about red wine, my experience would tell me that the alcohol content, acidity (pH), and the amount of sulphates may have a real impact on the quality of a particular wine. Additional features that may be of intrest are chlorides, citric.acid, or residual.sugar. We will also explore the relationship of volatile.acidity, fixed.acidity, and sulfur.dioxide. Later in our data exploration, we may also create bins for more qualitative data analysis.
Lastly, I created a binary field for “good” and “bad” wines. I am stepping outside of my role as a data explorer here briefly and stepping into my role as a potential consumer of red wines. It is my opinion that as a consumer, I am less interested in trying to predict the “value” of the quality variable (i.e. whether a wine is a 3 versus a wine is a 7) but more interested in knowing the likelihood of whether a wine will be good or not. Therefore, I decided to bin the wine quality variable into a logistic (binary) variable so that I can later perform a logistic regression for classification or a decision tree. This will allow me, the consumer, to perhaps narrow down a list of wines that I may be considering for my next party to those that are most likely to be considered “good.” To me, this is much more useful!
## Warning in loop_apply(n, do.ply): Removed 117 rows containing non-finite
## values (stat_boxplot).
## Warning in loop_apply(n, do.ply): Removed 125 rows containing non-finite
## values (stat_boxplot).
## Warning in loop_apply(n, do.ply): Removed 125 rows containing non-finite
## values (stat_boxplot).
I was very surprised that pH did not seem to have a significantly strong distinction between the “good” wines versus the “bad” wines. When looking at the quality variable, it did appear that higher quality rated wines were slightly more acidic, but this relationship was not nearly as strong as I assumed. I was very surprised that residual.sugar did not have as strongly an effect on “good” versus “bad” wines either.
It was very informative to have noticed that the boxplots for citric.acid and volatile.acidity showed a relatively strong distinction between the “good” and “bad” wines. Based on my prior (limited) knowledge of wines, I would not have known the importance of these features. However, I do know about the importance of alcohol content, most likely due to the fact that common cultural lingo often refers to wines and spirits in relation to their “percent of alcohol.” I noticed that the “good” wines tend to have higher levels of alcohol content, for that stronger, fuller body taste.
I am most interested in the boxplot comparing the median values of alcohol between “good” and “bad” wines. I saw the the median value for alcohol among the “good” wines were quite higher than those of the “bad” wines. However, the alcohol content of the “bad” wines were also more volatile, with greater variance between the maximum and minimum values.
##
## Calls:
## logm1: glm(formula = good.bad ~ alcohol, family = binomial, data = wine_analysis)
## logm2: glm(formula = good.bad ~ alcohol + volatile.acidity, family = binomial,
## data = wine_analysis)
## logm3: glm(formula = good.bad ~ alcohol + volatile.acidity + sulphates,
## family = binomial, data = wine_analysis)
## logm4: glm(formula = good.bad ~ alcohol + volatile.acidity + sulphates +
## chlorides, family = binomial, data = wine_analysis)
## logm5: glm(formula = good.bad ~ alcohol + volatile.acidity + sulphates +
## pH, family = binomial, data = wine_analysis)
##
## ============================================================================
## logm1 logm2 logm3 logm4 logm5
## ----------------------------------------------------------------------------
## (Intercept) -10.763*** -8.344*** -9.720*** -9.341*** -9.310***
## (0.683) (0.726) (0.784) (0.794) (1.429)
## alcohol 1.056*** 1.005*** 0.997*** 0.950*** 1.003***
## (0.067) (0.069) (0.069) (0.070) (0.071)
## volatile.acidity -3.541*** -3.122*** -2.981*** -3.090***
## (0.355) (0.365) (0.368) (0.376)
## sulphates 1.873*** 2.517*** 1.852***
## (0.369) (0.437) (0.375)
## chlorides -4.297**
## (1.434)
## pH -0.143
## (0.418)
## ----------------------------------------------------------------------------
## Aldrich-Nelson R-sq. 0.177 0.222 0.233 0.236 0.233
## McFadden R-sq. 0.156 0.207 0.219 0.224 0.219
## Cox-Snell R-sq. 0.194 0.248 0.261 0.266 0.261
## Nagelkerke R-sq. 0.258 0.332 0.349 0.355 0.349
## phi 1.000 1.000 1.000 1.000 1.000
## Likelihood-ratio 343.939 456.470 484.482 493.959 484.599
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -932.517 -876.251 -862.245 -857.507 -862.187
## Deviance 1865.034 1752.503 1724.491 1715.014 1724.374
## AIC 1869.034 1758.503 1732.491 1725.014 1734.374
## BIC 1879.788 1774.634 1753.999 1751.900 1761.260
## N 1599 1599 1599 1599 1599
## ============================================================================
##
## 0 1
## 744 855
## predictions
## 0 1
## 788 811
##
## predictions 0 1
## 0 550 238
## 1 194 617
## [1] 0.7298311
##
## tree.predictions 0 1
## 0 598 253
## 1 146 602
## [1] 0.750469
Again, alcohol content, even as it related to other feature variables, was the most strongly explanatory for quality. I was most interested when alcohol was plotted in conjunction with sulphates, volatile.acidity, density, and chlorides as they appeared to further strengthen the decision boundary between “good” and “bad” wines.
Yes, I created a few logistic regression models as well as a decision tree model. First, it must be stated that this is not a true machine learning exercise. No attention was given to tuning, cross-validation, and pruning/penalization. The logistic regression model was useful in seeing what feature variables really affected the binary “good” and “bad” wines. This is useful to me as a consumer of wine, especially as I am often limited in the information I can gather in a very short period of time at the store. However, as a human, I also cannot easily implement a logistic regression model at the store, so I decided to also use a decision tree model. The decision tree is much more helpful for my next trip to the wine store; moreover, the decision tree captures interaction between the features that the logistic regression did not.
This boxplot places “bad” and “good” wines next to each other and their respective descriptive values for alcohol content. “Good” wines appeared to have higher levels of alcohol content, though the content level of “bad” wines are more dispersed.
I found the interaction between alcohol content and sulphates together to be most interesting in delineating between “bad” and “good” wines. The scatter as well as the density plot highlight this differentiation.
Though not strictly an exploratory plot, this decision tree shows the importance of the various features as well as a guide for purchasing wine. Furthermore, it highlights the interaction between alcohol, sulphates, and volatile acidity.
The Wine Quality dataset is a tidy dataset of 1,599 observations. It is made public for research by P. Cortez et al. I intentionally did not read the paper “Using Data Mining for Wine Quality Assessment” until after the final project, so as to prevent biases and to fully grapple the dataset on my own. I started by understanding the distribution of the quality of wines in the dataset. I then created a binary variable for labeling “good” and “bad” wines simply on whether the quality was greater than 5 or less than equal to 5. I then explored various features as well as combinations of features as they relate to the “good” and “bad” label. I was most positively suprised that chlorides, pH, residual sugar, and citric acid all did not have a stronger relationship to the quality label. Ultimately, I created a few logistic regression models as well as a decision tree model using all observations in the dataset. There are quite a few limitations to this model. First, I have simiplified the quality label into only two classes. As a consumer of wines, I felt that this was sufficient for my next trip to the store. Furthermore, the dataset is based on sensory judging, which may vary significantly from individual to individual. Nonetheless, the decision tree may provide a simple guide for deciding the next wine to try. It would also be helpful to have datasets from different time periods or different judges to see how “accurate” the models can still perform.